AD/ A- 007  247 


TIME  AND  FREQUENCY  RESOLUTION  IN 
SPEECH  ANALYSIS  AND  SYNTHESIS 

Aubrey  M.  Bush 

Georgia  Institute  of  Technology 


Prepared  for: 

Army  Research  Off  ice- Durham 


10  January  1975 


DISTRIBUTED  BY: 


National  Technical  Information  Service 
U.  S.  DEPARTMEPT  OF  COMMERCE 


UNCLASSIFIED 


SECURITY  CLASSIFICATION  of  THIS  PAGE  /PTimi  Dele  Entered) 


REPORT  DOCUMENTATION  PAGE 

READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 

1.  REPORT  NUMBER 

2.  GOVT  ACCESSION  NO. 

3.  RECIPIENTS  CATALOG  NUMBER  ^ 

E21-611-74-BU-2 

HJ>//9'oa7£¥7 

4 TITLE  (end  Subtitle) 

5.  TYPE  OF  REPORT  ft  PERIOD  COVERED 

Time  and  Frequency  Resolution  in  Speech 

Final 

Analysis  and  Synthesis 

S.  PERFORMING  ORG.  REPORT  NUMBER 

7.  AUTHOR/s) 

B.  CONTRACT  OR  GRANT  NUMBER/*) 

Aubrey  M.  Bush 

DA-AR0-D-31-124-71-G126 

9.  PERFORMING  ORGANIZATION  NAME  AND  ADDRESS 

10.  PROGRAM  ELEMENT.  PROJECT.  TASK 

AREA  ft  WORK  UNIT  NUMBERS 

Georgia  Institute  of  Technology 

School  of  Electrical  Engineering 

II.  CONTROLLING  OFFICE  NAME  ANO  AOORESS 

12.  REPORT  DATE 

U.S.  Army  Research  Office 

January  10,  197^ 

Box  CM  Duke  Station 

13.  NUMBER  OF  PAGES 

Durham.  NC  27706 

14  MONITORING  AGENCY  NAME  ft  ADDRESS///  different  trom  Controlling  Office) 

IS.  SECURITY  CLASS,  (of  hie  report) 

Same 

UNCLASSIFIED 

IS<>.  DECLASSIFICATION 'DOWNGRADING 

SCHEDULE 

16.  DISTRIBUTION  STATEMENT  (of  thl e Repo it) 


Unlimited,  Open  Publication 


• 7.  DISTRIBUTION  STATEMENT  (of  the  *b«fr«cf  entered  In  Block  20,  If  different  from  Report) 


Same 


ie.  supplementary  notes 


Reproduced  by 


NATIONAL  TECHNICAL 
INFORMATION  SERVICE 

U S Department  of  Commerce 
Springfield  VA  22151 


19  KEY  WOROS  (Continue  on  reverse  side  if  necessity  end  Identify  by  block  number) 

Digital  Signal  Processing,  Speech  Compression,  Vocoders,  Digitisation, 
Speech  Quality,  Digital  Speech  Transmission 


20  ABSTRACT  'Continue  on  reverse  tide  II  necessity  end  Identity  by  block  number) 


DD 


A problem  basic  to  the  development  of  all  digital  telecommunication 
systems  is  the  efficient  digital  encoding  of  speech  signals  for  transmission. 
Many  speech  digitization  algorithms  have  been  proposed.  This  study,  t'iing 
as  a research  vehicle  the  homomorphic  vocoder  algorithm,  is  directed  toward 
determining  the  time  resolution  and  the  frequency  resolution  required  to 
faithfully  reproduce  a speech  signal.  ihe  results  reported  here  are  funda- 
mental in  that  they  are  applicable  to  any  speech  digitization  algorithm. 


F QRM 
• JAN  73 


1473 


EDITION  OP  I NOV  65  IS  OBSOLETE 


/ UNCLA 


SECURITY  CLASSIFICATION  OF  THIS  PAGE  (Tl'/ton  De’i  Entered) 


SECURITY  CLASSIFICATION  OF  THIS  PACEfNTiwi  Data  Bitarad) 


ABSTRACT  (Cont’d) 

The  resolutions  required  for  typical  speaker  situations  are  summarized. 

It  was  demonstrated  that  the  dominant  feature  of  speech  analysis- 
synthesis  digitization  algorithms  is  the  pitch  extraction;  pitch  errors  often 
lead  to  gross  distortion  of  the  reconstructed  speech. 


TIME  AND  FREQUENCY  RESOLUTION  IN  SPEECH 
ANALYSIS  AND  SYNTHESIS 


FINAL  REPORT 


JANUARY  10,  1975 


U.  Ss  ARMY  RESEARCH  OFFICE  - DURHAM 


GRANT  NO.  DA-AR0-D-31-124-71-G126 


GEORGIA  INSTITUTE  OF  TECHNOLOGY 
SCHOOL  OF  ELECTRICAL  ENGINEERING 
E21-611-74-BU-2 

APPROVED  FOR  PUBLIC  RELEASE 
DISTRIBUTION  UNLIMITED 


«« ' 
/// 


r 


THE  FINDINGS  IN  THIS  REPORT  ARE  NOT  TO  HE 
OOMSTRUED  AS  AN  OFFICIAL  DEPARPPIT  OF 
THE  m POSITION/  UNLESS  SO  DESIGNATED 
BY  OTHER  AUTHORIZED  DOdfPnS. 


\ 


* 


/V 


CONTENTS 


STATEMENT  OF  THE  PROBLEM  1 

Speech  Production L 

Spectrum  Analysis  3 

Linear  Prediction 5 

Time-Frequency  Resolution  7 

TECHNIQUES,  TOOLS,  AND  PROCEDURES  8 

The  Homomorphic  Vocoder  8 

Subjective  Testing 9 

The  Simulation  System  12 

SUMMARY  OF  RESULTS  14 

Goal  14 

Conclusion  One:  Adequate  Vocoder  Time  Resolution  14 

Conclusion  Two:  Time-Frequency  Trading  in  Quality 

Perception  16 

Conclusion  Three:  The  Effect  of  Reduced  Frequency 

Resolution  in  Unvoiced  and 

Transition  Regions  16 

Conclusion  Four:  The  Effect  of  the  Adaptive  Strategy  16 

Discussion  19 

Influence  of  Pitch  Signal  Quality  21 

PUBLICATIONS  AND  TECHNICAL  REPORTS  23 

PARTICIPATING  PERSONNEL 25 

REFERENCES  26 


V 


h 


STATEMENT  OF  THE  PROBLEM 


Speech  Product ion [ 11 

The  sounds  of  human  speech  are  produced  when  the  vocal  tract  (an 
acoustic  cavity)  is  excited  by  a flow  of  air  from  the  lungs.  Voiced 
speech  results  when  air  is  forced  through  the  glottis  (the  opening 
between  the  vocal  cords)  while  the  vocal  cords  are  held  under  tension. 

The  glottis  oscillates  causing  a quasi-periodic  flow  of  ai’-  to  excite 
the  vocal  tract.  The  glottal  signal  is  a pulse-train  time-function, 
rich  in  harmonics.  The  fundamental  frequency  of  vocal  cord  oscillation 
is  called  the  voice  pitch. 

The  excitation  signal  from  the  glottis  passes  through  the  vocal 
tract,  whicl  includes  the  throat,  mouth,  and  nasal  cavity.  The  message 
the  talker  wants  to  convey  is  imposed  on  the  excitation  signal  by  the 
changes  in  position  of  the  tongue,  lips,  and  other  moving  parts  of  the 
tract,  These  moving  parts  are  called  articulators,  and  their  activity 
in  creating  the  spoken  language  is  called  articulation.  During  articula- 
tion the  vocal  cavity  assumes  different  positions  causing  resonances  in 
the  tract  which  alter  the  spectrum  of  the  excitation  signal,  imposing  on 
the  spectrum  peaks  which  are  called  formants. 

Unvoiced  speech  is  produced  by  a turbulent  flow  of  air  past  a con- 
striction in  the  vocal  tract,  or  by  a release  of  pressure  at  some  point 
of  closure  in  the  tract.  Unvoiced  excitation  is  an  acoustic  noise  source. 
The  spectrum  of  an  unvoiced  speech  sound  is  influenced  mainly  by  that 
portion  of  the  vocal  tract  forward  of  the  constriction.  Pressure  released 
at  a closure  causes  an  initial  burst,  followed  by  turbulent  flow  noise. 
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The  phoneme  is  the  smallest  unit  of  speech  that  distinguishes  one 
utterance  from  another.  General  American  English  has  about  42  phonemes  [1]. 
We  may  think  of  these  phonemes  as  a code  uniquely  related  to  the  articu- 
latory gestures  of  the  language. 

The  vowel  sounds  of  speech  are  produced  by  voiced  excitation  of  the 
vocal  tract  (e.g.,  the  "ah"  in  father).  In  normal  articulation  the  tract 
is  held  in  a relatively  stable  position  during  most  of  the  sound.  Vowels 
usually  have  a "duration"  of  60  ms  or  longer. 

The  fricative  consonant  phonemes  are  produced  by  incoherent  noise  / 
excitation  of  the  tract  caused  by  turbulent  air  flow  at  a constriction 
(e.g.,  the  "s"  in  see).  The  vocal  cord  source  may  operate  in  conjunction 
with  the  noise  source  to  produce  a voiced  fricative  (e.g.,  the  "z"  in 
zoo) . 

Stop  consonants  are  produced  by  the  abrupt  release  of  pressure  at 
a place  of  closure  in  the  tract  (e.g.,  the  "t"  in  to).  The  articulatory 
movements  which  generate  stops  are  more  rapid  than  for  other  sounds. 

Stops  may  be  voiced  or  unvoiced. 

The  remaining  consonants  are  classified  as  nasals,  glides,  semi- 
vowels, dipthongs,  and  affricates. 

A key  observation  about  the  phonemes  of  speech  is  that  some  sounds 
(e.g.,  the  stop  consonants)  are  produced  by  a rapid  motion  of  the  articu- 
lators, while  others  (e.g.,  the  vowels)  are  produced  by  a relatively 
stable  vocal  tract  configuration.  The  striking  difference  between  the 
character  of  "short"  sounds  are  "long"  sounds  suggests  that  all  sounds 
should  not  be  processed  in  the  same  manner. 
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Spectrum  Analysis 

The  traditional  tool  for  spectrum  analysis  of  signals  and  linear 
systems  is  the  Fourier  transform  pair: 

S(f)  = J s(f)e"j2TTft  dt  s(t)  = J S(f)e+j2nft  df  (1) 

-CO  -03 

In  analyzing  a speech  signal,  however,  the  future  is  not  available,  and 
only  the  very  recent  past  is  of  interest.  So  we  adopt  the  short-time 
spectrum  S(t,f)  [1]: 

S(t,f)  = f s(T)w(t-T)G"j2nfT  dT  , (2) 

t-DJ 

where  D is  the  duration  of  the  window  function  w(t).  The  short-time 
spectrum  is  the  Fourier  transform  of  the  recent  past  of  the  time  function 
s(t)  weighted  by  the  window  function  w(t-T).  Thus,  S(t,f)  describes  the 
distribution  of  energy  in  frequency,  as  it  changes  with  time.  The  genera- 
tion and  coding  of  this  short-time  spectrum  are  the  central  features  of 
most  speech  data  rate  reduction  systems. 

The  short-time  spectrum  may  be  displayed  with  a sound  spectogram--a 
representation  of  the  time-frequency-intensity  coordinates  of 
|S(t,f)j  [1,5].  The  display  is  generated  by  playing  a recorded  passage 
of  speech  (typically  2. A seconds)  through  a narrow  band-pass  filter  and 
envelope-detector.  The  contour  |S(t,f^)|  is  "burned"  on  Teledeltos 
paper  for  successive  filter  center- frequencies  f^,  with  relative  darkness 
displaying  intensity  on  a logarithmic  scale.  Two  analyzing  filter  band- 
widths  ere  commonly  available--45Rz  and  300Hz.  The  choice  of  filter 
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bandwidth  corresponds  (roughly)  to  the  choice  of  the  duration  of  the 
window  function  in  (2). 

Tn  the  narrow-band  mode,  the  frequency  resolution  is  sufficient  to 
display  the  voice  pitch  and  its  harmonics,  but  the  time  resolution  is 
relatively  poor.  In  the  wide-band  mode,  the  time  resolution  is  sufficient 
to  display  individual  glottal  pulses,  but  the  frequency  resolution  is 
relatively  poor.  The  formant  structure  of  the  speech  signal  may  be 
observed  in  voiced  portions  of  the  spectograms. 

Notice  the  effect  of  the  window  function  on  the  short-time  spectrum. 
The  scaling  property  of  the  Fourier  transform 

s(t)  *-»■  S(f)  =»  s(at)  t-t  S(~)  (3) 

t f ial  3 

shows  that  as  the  effective  "duration"  of  a window  function  w(t)  is  made 
shorter,  the  "bandwidth"  of  its  spectrum  W(f)  is  broadened,  and  vice  versa. 
Since  multiplication  of  the  signal  s(t)  by  the  window  w(t-T)  is  equivalent 
to  convolving  their  spectra  it  is  clear  that  the  spectrum  of  a sinusoid 
viewed  through  a time  window  is  broadened  as  the  duration  of  the  window 
is  decreased.  The  choice  of  a window  function  involves  a compromise  be- 
tween the  time  "resolution"  and  frequency  "resolution"  that  may  be 
achieved  in  the  short-time  spectrum  [5]. 

Digital  spectrum  analysis  is  accomplished  with  the  discrete  Fourier 
transform  (DFT)  pair: 


N'1  , -j2nnk/N 

; £ s(nT)e 

n=0 


, . 1 N"X  +j2nnk/N 

s(nT)  = - 2 S(kF)e  J 

N k=0 


(4) 
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where  T is  the  sampling  interval  of  the  time  function  s(t),  N is  the 
number  of  samples  to  be  transformed,  F = ~ is  the  sampling  interval  of 
the  spectrum  [8]. 

To  obtain  the  discrete  short-time  Fourier  transform,  we  introduce 
the  window  function  into  (4): [5] 

S (kF)  = Z w(nT)  s(nT  + rMT)e‘j2fTnk/N  . (5) 

r n=0 

The  index  r corresponds  to  the  time  variable  in  (2).  The  short  time 

spectrum  is  evaluated  at  times  t = rMT,  for  r = 0,1,2 The  window 

is  propagated  along  the  time  function  S(t)  in  steps  of  MT  seconds.  The 
samples  of  (5)  represent  the  samples  of  (2)  to  within  a phase  constant, 
i.e. 


|Sr(kF)|  = j S(rMT  + D,  kF) | . (6) 

The  discrete  short-time  Fourier  transform  may  be  efficiently  computed 
digitally  with  the  Fast  Fourier  Transform  Algorithm, [8][9] 

The  short  time  spectrum  can  also  be  plotted  by  digital  means  either 
for  display  on  a CRT  or  a Calcomp  or  similar  plotter,  using  "3-D"  plotting 
routines.  In  this  case,  a wider  choice  of  parameters  is  available. 

Linear  Prediction 

Techniques  for  speech  analysis  which  separate  the  vocal  tract  response 
function  and  the  excitation  signal  using  time  domain  techniques  rather 
than  concentrating  initially  on  a short  time  spectrum  are  generally  % 

lumped  into  a class  of  system  referred  to  in  the  literature  as  "LPC's" 
or  linear  predictive  coders. [10]  This  class  of  speech  analysis  operates 
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as  one  of  several  possible  forms  of  least  squares  prediction  algorithms. [11] 
The  vocal  tract  is  assumed  to  be  an  all  pole  filter,  excited  by  an  impulse 
train  or  white  noise.  The  prediction  algorithm  is  chosen  to  locate  the 
poles  of  the  vocal  tract  filter.  The  vocal  tract  prediction  is  then 
subtracted  from  the  signal  to  leave  a residual  containing  primarily 
excitation  information.  Although  algorithms  which  seek  to  determine 
structural  details  of  filters  are  in  general  nonlinear  [12],  utilization 
of  an  inverse  filter  format  allows  this  problem  to  become  a linear  optimiza- 
tion algorithm.  [9] 

The  input  signal,  preparatory  to  the  prediction,  is  first  placed 
in  digital  format.  Then,  depending  on  the  particular  prediction  chosen, 
some  form  of  windowing  strategy  is  chosen.  The  optimization  may  be  either 
a block  algorithm  or  a point -by-point  algorithm.  If  the  block  strategy, 
which  is  the  most  common  technique,  is  selected,  a window  and  framing 
operation  must  be  defined  explicitly.  The  window  may  be  rectangular, 
Hamming,  Hanning  [13],  etc.  The  frame  interval  is  usually  equal  to  the 
window  length  but  may  be  longer  or  shorter.  There  may  be  an  overlap  of 
windows  in  the  framing  process.  A variety  of  initialization  techniques 
may  be  used  at  each  step  [14].  If  the  point-by-point  [15]  approach  is 
chosen,  the  algorithm  itself  will  implicitly  define  a windowing  of  the 
signal.  In  this  case,  the  window  is  a sliding  window,  and  is  generally 
an  exponential  window. 

Extensive  work  on  the  prediction  algorithms  has  been  undertaken  and 
reported  in  the  literature.[16][17],[10][ll],[15],[18]-[20]  Attention  has 
not  been  directed  toward  best  windowing  and  framing  strategies.  The  fact 
that,  even  though  the  LPC  is  a "time  domain"  approach,  a frequency 
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resolution  and  time  resolution  are  nevertheless  determined  by  the 
parameters  chosen  in  the  algorithm  is  not  generally  appreciated. 

Time- Frequency  Resolution 

The  primary  concern  in  this  research  has  been  to  determine,  via 
subjective  listening  tests  in  a controlled  laboratory  environment,  the 
time  and  frequency  resolution  required  to  faithfully  reproduce  speech 
signals. 

This  was  accomplished  by  processing  speech  signals  using  as  a 
research  vehicle  the  homomorphic  vocoder  described  below. 
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TKCHNIQUES,  TOOLS,  AND  PROCEDURES 
The  Homomorphic  Vocoder 

The  spectrum  of  a quasi-stationary  segment  of  speech  is  the  product 
of  the  excitation  spectrum  E(f)  and  the  vocal  tract  system  function  H(f). 

8<t)t*-*f  S(f)  = E(f)  H(f) 

The  logarithm  of  the  amplitude  spectrum  |S(f)|  is 

In  J S (f ) | = In | E ( f ) | + lnjH(f) | (7) 

in  which  the  influence  of  source  and  vocal  tract  are  additive.  Taking 
the  Fourier  transform  of  ln|S(f)j  yields  the  so-called  "cepstrum" 
C(t)[4],[8]: 

00 

C(t)  = 5 {lnjS(f)  j ) = J ln|S(f)|e"j2TTfT  df 

-00 

= °£{lnjE(f)  | } + {ln|H(f)l}  . (8) 


During  voiced  sounds  the  two  components  of  the  cepstrum  occupy  different 
regions  in  the  "quefrency"  variable  t. 

Since  | H (f ) | is  a "smooth"  function  of  frequency,  so  is  In | H(f ) | , 
and  the  contribution  to  the  cepstrum  is  essentially  confined  to  the  low- 
quefrency  region  on  the  t axis.  On  the  other  hand,  for  a voiced  sound 
j F C f ) J is  essentially  periodic  in  f (with  peaks  separated  by  the  pitch 
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frequency  f^)  so  the  contribution  to  the  cepstrum  is  a spike  at  TQ*=l/fp, 
the  pitch  period. 

It  is  clear  that  to  the  extent  to  which  the  cepstrum  of  E(f)  and  H(f) 
are  disjoint  in  quefrency  t the  vocal  tract  influence  may  be  isolated  by 
low  quefrency  filtering,  while  the  excitation  may  be  characterized  by  a 
measure  of  the  pitch  period  Tq  in  the  high  quefrency  region.  This 
describes  the  strategy  of  the  homomorphic  vocoder. 

It  has  been  reported  that  samples  of  the  cepstrum  out  to  t = 3 ms 
are  adequate  for  speech  signals.  [4]  Thus,  in  the  digital  homomorphic 
vocoder,  the  vocal  tract  spectrum  is  encoded  by  the  low  quefrency  cepstrum 
samples,  while  voicing  information  is  transmitted  as  a voiced /unvoiced 
decision  and  the  pitch  frequency  determined  from  the  peak  in  the 
cepstrum. 

A block  diagram  of  the  homomorphic  vocoder  is  shown  in  Figure  1. 

Typical  waveforms  encountered  in  this  vocoder  are  shown  in 
Figure  2. 

The  time  and  frequency  resolution  of  the  homomorphic  analysis 
technique  are  directly  controlled.  A detailed  discussion  has  been 
developed  by  Patisaul. [21]  The  time  resolution  is  determined  by  the 
width  of  the  window  in  the  time  domain.  The  frequency  resolution  is 
determined  by  the  length  of  the  window  in  the  cepstral  domain.  Under 
suitable  conditions,  generally  appropriate  to  speech  analysis, [21]  the 
time  and  frequency  resolutions  at;'  -ontrolled  essentially  independently. 

Subjective  Testing 

A set  of  four  test  sentences  taken  from  the  Harvard  list  of 
phonetically  balanced  sentences  spoken  by  a male  and  by  a female  speaker 


Figure  1 The  Homomorphic  Vocoder. 
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was  used  as  source  material.  These  sentences  were  processed  by  a homo- 
morphic vocoder  utilizing  a wide  range  of  windows  in  both  the  time  and 
cepstral  domains.  Fixed  windowing  strategies  utilizing  a given  combina- 
tions of  windows  throughout  were  studied  as  well  as  adaptive  strategies 
using  one  combination  of  windows  for  voiced  segments  and  another  for 
unvoiced  segments.  Each  combination  of  windows  is  referred  to  as  a mode 
of  operation  of  the  vocoder. 

A category  preference  test  was  chosen  for  evaluation  of  the  test 
sentence  quality.  The  results  of  the  category  preference  judgements  was 
analyzed  in  great  detail  to  determine  significant  features. 

Each  listener  in  the  Category  Judgement  test  is  required  to  rank  the 
test  sentences  on  a scale  of  0-8  compared  to  a "good"  and  "poor"  sample 
which  are  presented  periodically  throughout  the  test  for  reference.  A 
mean  category  judgement  (MCJ)  can  be  obtained  by  averaging  across  all 
listeners. 

This  testing  technique  is  felt  to  be  nearer  real  world  conditions 
than  tests  involving  direct  one-to-one  comparisons. 

The  Simulation  System 

The  system  used  to  implement  the  homomorphic  vocoder  and  conduct  the 
listening  tests  has  been  developed  over  the  past  three  years. 

Initially,  simulations  were  run  on  a central  computer  facility,  a 
Univac  U1108.  Input  was  accomplished  via  a Radiation,  Inc.  A/D  unit  at 
the  central  facility;  output  was  taken  off  on  paper  tape  and  D/A  conver- 
sion accomplished  on  a PDP-8  minicomputer  located  at  another  site. 

Improvement  was  made  by  the  middle  of  the  first  year  of  the  project 
by  substitution  of  a magnetic  tape  link  for  the  paper  tape  link.  A 
Honeywell  H-316  was  used  in  place  of  the  PDP-8. 
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However,  input/output  constraints  were  still  so  severe  that  no  user 
interaction  could  be  allowed  and  progress  in  running  simulations  was 
extremely  slow. 

During  the  second  year  of  the  project  a dedicated  minicomputer 
system  was  secured  for  our  speech  research  work.  Suostantial  effort  was 
devoted  to  the  development  of  this  system  over  this  period. 

The  system  is  uniquely  suited  to  speech  research.  It  is  highly 
interactive  and  has  the  memory  and  peripherals  required  to  efficiently 
conduct  speech  research. [22] 

All  of  the  results  reported  below  were  obtained  in  the  third  year 
of  the  project  using  this  dedicated  minicomputer  facility. 

Subjective  testing  was  carried  out  using  a professional  quality 
audio  system  in  a room  having  a controlled  environment,  connected  to 


the  computer  room  via  coaxial  cables. 
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SUMMARY  0?  RESULTS 

Goal 

The  goal  of  this  research  was  to  study  the  effects  of  vocoder  time- 
frequency  resolution  on  the  perceived  quality  of  vocoder  speech.  This 
section  sets  forth  several  conclusions  concerning  time- frequency  resolu- 
tion and  speech  quality.  These  conclusions  were  drawn  from  the  results  of 
the  Category  Judgement  evaluation  of  the  vocoded  speech  signals  obtained 
from  the  simulation  phase  of  the  research. 

It  is  hoped  that  these  conclusions  will  contribute  to  the  further 
understanding  of  the  speech  perception  process  and  that  they  will  point 
the  way  toward  improvements  in  vocoder  design.  In  addition,  these  con- 
clusions should  suggest  areas  for  further  research. 

Conclusion  One;  Adequate  Vocoder  Time  Resolution 

Figure  3 shows  the  MCJ's  for  the  nouadaptive  vocoder  plotted  as  a 
function  of  frame  interval  pad  cepstrum  length.  For  each  condition  of 
frequency  resolution,  the  two  shorter  frames  showed  an  advantage  in  qual- 
ity over  the  40.0  ms  frame.  The  performances  of  the  10.0  ms  frame  and 
the  20.0  ms  frame  were  comparable  for  all  cepstrum  lengths  except  4.0  ms 
where  the  20.0  ms  frame  was  signficantly  better.  This  difference  is  un- 
explained but  may  be  the  results  of  some  sort  of  time- frequency  trading  or 
simply  a "quirk"  in  the  data.  It  should  be  noted  that  the  quality  at  a 
cepstrum  length  of  1.0  ms  was  essentially  the  same  for  all  three  frames, 
suggesting  that  poor  frequency  resolution  was  the  predominant  factor  at 
that  point. 


MCJ 
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Cepstrum  Length  (ms) 


Figure  3.  Performance  of  Nonadaptive  Configurations 
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The  conclusion  to  be  drawn  from  this  set  of  results  is  that  main- 
taining time  resolution  better  than  about  20.0  ms  seems  to  provide  no 
improvement  in  speech  quality. 

Conclusion  Two:  Time- Frequency  Trading  in  Quality  Perception 

An  examination  of  the  nonadaptive  vocoders  with  20.0  ms  and  40.0  ms 
frames  in  Figut j 3 shows  that  configurations  with  equivalent  data  rates 
(for  example,  40.0  ms  frame  and  4.0  ms  cepstrum  compared  to  20.0  ms  frame 
and  2.0  ms  cepstrum)  have  roughly  equivalent  quality.  This  observation 
may  be  interpreted  as  evidence  of  time- frequency  trading  in  speech 
perception. 

Conclusion  Three:  The  Effect  of  Reduced  Frequency  Resolution  in  Unvoiced 

and  Transition  Regions 

Figure  4 and  5 show  plots  of  the  MCJ's  for  the  various  adaptive 
vocoder  configurations.  Note  that  reducing  the  Mode  2 cepstrum  length 
had  no  appreciable  effect  on  the  speech  quality.  This  result  was  indepen- 
dent of  Mode  1 frame.  Mode  1 cepstrum,  and  Mode  2 frame.  The  conclusion  is 
that  frequency  resolution  can  be  reduced  considerably  in  unvoiced  regions 
and  regions  of  voiced-unvoiced  or  unvoiced-voiced  transition  with  little 
or  no  loss  in  speech  quality.  This  is  an  important  result  for  the  design 
of  adaptive  vocoders  which  must  maintain  a constant  data  rate.  This  con- 
clusion also  lends  support  to  the  notion  of  time- frequency  trading  in 
speech  perception. 

Conclusion  Four:  The  Effect  of  the  Adaptive  Strategy 

The  effect  of  the  adaptive  strategy  is  displayed  in  Figure  6.  This 
figure  gives  a comparison  of  the  MCJ's  for  several  versions  of  the  adap- 
tive and  nonadaptive  vocoders.  For  the  20.0  ms  frame  at  a fixed  cepstrum 
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length,  adapting  to  the  10.0  ms  frame  for  unvoiced  and  transition  regions 
resulted  in  no  improvement  in  quality.  This  observation  is  in  agreement 
with  Conclusion  One  and  suggests  that  time  resolution  of  about  20.0  ms  is 
sufficient  for  vocoders. 

For  the  40.0  ms  frame  and  4.0  ms  cepstrum,  adaption  has  no  noticeable 
effect.  However,  for  the  40.0  ms  frame  and  either  the  3.0  ms  or  2.0  ms 
cepstrum,  adapting  to  a shorter  frame  in  unvoiced  and  transition  regions 
led  to  improved  quality.  The  improvement  obtained  was  independent  of  the 
Mode  2 frame  used,  giving  still  more  support  to  Conclusion  One.  The 
improvement  was  roughly  equivalent  to  increasing  the  cepstrum  length  of 
the  nonadaptive  vocoder  by  1.0  ms.  It  is  not  clear  why  adaption  produced 
no  improvement  in  quality  for  the  4.0  ms  cepstrum  case. 

It  appears  that  20.0  ms  time  resolution  is  adequate  for  vocoder 
applications.  Thus  adaption  in  a system  which  normally  maintains  20.0  ms 
resolution  or  better  yields  no  improvement  in  performance.  For  systems 
that  do  not  normally  employ  such  good  time  resolution,  adaption  seems  to 
offer  considerable  potential. 

It  should  be  pointed  out  that  interpolation  of  impulse  responses 
would  probably  improve  the  performance  of  the  ’’ocoders  using  40.0  ms 
frames  in  voiced  regions.  Thus  a combination  of  interpolation  and  adap- 
tion in  these  systems  might  well  produce  good  quality  speech  at  quite  low 
data  rates. 

Discussion 

The  conclusions  presented  above  should  be  regarded  as  tentative  for 
several  reasons.  Only  two  speakers  and  four  utterances  were  used  in  the 
research.  Only  one  type  of  vocoder  was  employed  and  speech  quality  was 
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judged  by  only  one  of  several  methods  available.  It  is  certainly  conceiv- 
able  that  a similar  study  with  different  source  material,  a different 
vocoder,  or  a different  quality  measurement  technique  could  produce  dif- 
ferent results. 

At  an  early  stage  in  this  research  it  became  evident  that  the  two 
speakers  used  produced  vocoded  speech  of  widely  differing  quality.  The 
male  speaker  rated  higher  in  quality  than  the  female  in  all  instances 
except  three.  This  dependence  of  quality  on  the  speakers  may  have 
colored  the  results  of  this  study. 

If  a speaker  sex  dependent  quality  difference  does  exist  in  vocoder 
speech,  it  may  be  possible  to  explain  it  in  terms  of  fundamental  periods. 
Females  typically  have  shorter  fundamental  periods  than  males  so  that 
there  is  more  overlap  of  impulse  responses  during  voiced  speech  and 
deconvolution  is  more  difficult.  An  equivalent  explanation  can  be  given 
in  the  frequency  domain.  Since  female  speakers  generally  have  shorter 
fundamental  periods,  the  spectral  lines  in  the  short-time  spectrum  are 
spaced  further  apart  so  that  the  envelope  due  to  the  vocal  tract  is 
"sampled"  less  often  in  frequency  making  deconvolution  by  smoothing  more 
difficult.  In  the  particular  case  of  the  cepstrum  vocoder,  short  funda- 
mental periods  cause  the  excitation  portion  of  the  cepstrum  to  encroach 
on  the  vocal  tract  portion  with  a resulting  loss  of  quality  in  the 
deconvolution. 

Influence  of  Pitch  Signal  Quality 

Pitch,  for  the  purposes  of  speech  signal  analysis,  can  be  defined  as 
the  spacing  between  impulses  in  the  excitation  signal  during  voiced  seg- 
ments. This  definition,  though  adequate  for  speech  signals,  ignore 
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effects  observed  by  some  in  perception  of  pitch  of  higher  frequency 
complex  tones. [23] 

The  test  results  obtained  in  the  study  of  time- frequency  resolution[21] 
and  summarized  above  utilize  a "hand-painted"  or  perfect  pitch  excitation 
signal. 

Hand-painting  is  done  best  by  interactively  studying  the  input 
speech  signal  using  the  dedicated  interactive  simulation  facility. [24] 

A segment  of  speech  is  displayed  on  a CRT,,  together  with  displays  of  its 
spectrum,  the  output  of  a cepstrum  pitch  estimator,  the  output  of  an 
autocorrelation  pitch  estimator,  and  the  output  of  a minimum  difference 
pitch  estimator.  The  operator  then  examines  the  time  waveform  and  all 
the  available  pitch  estimates  and  inputs  his  own  choice  of  pitch  period 
for  that  segment  based  on  all  available  information. 

The  perceived  speech  quality  of  tho  synthesized  speech,  not  only  for 
the  homomorphic  vocoder  but  for  any  vocoder  or  LPC  algorithm  for  speech 
synthesis,  is  found  to  be  more  sensitive  to  the  quality  of  the  pitch 
signal  than  to  any  other  single  parameter.  Perturbations  in  the  pitch 
signal,  using  known  terms  of  pitch  estimation,  consist  of  small  local 
offsets  and  large  global  errors  or  jumps.  The  large  errors  are  often 
harmonically  related  as  doubling,  tripling,  or  halving  of  the  pitch  period. 
Local  errors  can  cause  a "wavering"  quality  or  uncertainness  in  the 
synthetic  speech.  Global  errors  can  cause  total  distortion  including 
loss  of  intelligibility  as  well  as  reduced  quality.  Synthetic  speech 
Jtilizing  artificial  pitch  information,  as  in  reading  machines,  produces 
a very  unnatural  or  machine  like  quality. 
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PUBLICATIONS  AND  TECHNICAL  REPORTS 

During  the  period  covered  by  this  grant  a number  of  publications 

and  reports  have  been  generated.  These  are  listed  below.  Additional 

support  for  speech  related  work  was  obtained  during  the  grant  period. 

The  list  of  reports  includes  the  results  of  these  studies  as  well. 

To  be  submitted  for  publication: 

"A  Time-Frequency  Resolution  Experiment  in  Speech  Analysis 
and  Synthesis,"  C.  R.  Patisaul  and  J.  C.  Hammett. 

Submitted  for  publication: 

"The  Multibani  Pitch  Detector,"  C.  R.  °atisaul  and 
T.  P.  Barnwell. 

Presented  at  Conferences  and  Published  in  Conference  Record: 

"Gapped  ADPCM  for  Speech  Digitization,"  T.  P.  Barnwell  and 
A.  M.  Bush,  NEC  '74  Conference  Record,  October,  1974. 

"A  Minicomputer  Based  Digital  Signal  Processing  System," 

T.  P.  Barnwell  and  A.  M.  Bush,  EASCON  '74  Conference 
Record,  October,  1974. 

Theses: 

"Adaptive  Time -Frequency  Resolution  in  Vocal  Tract 
Parameter  Coding  for  Speech  Analysis  and  Synthesis," 

C.  R.  Patisaul,  Ph.D.  Thesis,  June,  1974. 
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Reports: 

"Adaptive  Differential  PCM  Speech  Transmission,"  T.  P. 

Barnwell,  A.  M.  Bush,  J.  B.  O'Neal,  and  R.  W.  Stroh, 
RADC-TR-74-177 , Final  Report,  July,  1974. 

1 "Pitch  and  Voicing  in  Speech  Digitization,"  T.  P.  Barnwell, 

J.  E.  Brown,  and  C.  R.  Patisaul,  Georgia  Institute  of 
Technology,  School  of  Electrical  Engineering,  Research 
Report  E21-620-74-BU-1,  August,  1974. 

"Recursive  Algorithms  for  Data  Processing,"  J.  E.  Brown, 

Georgia  Institute  of  Technology,  School  of  Electrical 
Engineering,  Internal  Memorandum  Report,  August,  1974. 
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